5  Loading, explore, wrangle

Now you know

Now, let’s get into business…

In this session, we will

5.1 Make a file

Generate an R Script or R Markdown file.

Name it “loading_management”.

Save it in the R folder.

5.2 Load packages

Before working with data, we load the packages we will use. For now, we will primarily use the tidyverse package (which includes several packages such as dplyr, readr, ggplot2, etc.).

Code
library(tidyverse)

5.3 Loading data

Data can be stored in many different file formats. Each format has advantages and disadvantages — for instance, some are faster to read, while others better preserve variable types.

When working in R, using .rds or .RData files is often more efficient than formats like .csv or .xlsx.

.rds files store a single R object (like a data frame) and preserve its structure, including variable types, factor levels, and attributes. This means you don’t need to re-specify column types or re-clean the data after loading. Moreover, you can save objects (like a model fit) in .rds format, which can be very handy if you’ve run a computationally heavy model. .RData files can store multiple R objects in one file. When you load an .RData file, all saved objects are restored into your environment.

These formats are:

  • Fast to read and write

  • Compact in size

  • Native to R, meaning fewer surprises when loading

In contrast, formats like .csv are plain text and don’t preserve metadata — so you often need to re-define variable types (e.g., dates, factors) after loading.

Use .rds when you want to save and load a single object.

Use .RData when you want to save multiple objects together.

However, you will need toknow how to load different file formats.

Here’s some quick examples on how to read different formats.

Code
#---- .rds
data.rds <- readRDS("path_to_my_file/data.rds")

#---- .RData

#---- .csv
data.csv <- readr::read_csv("path_to_my_file/data.csv")
data.csv <- vroom::vroom("path_to_my_file/data.csv")

#---- .excel
data.excel <- readxl::read_excel("path_to_my_file/data.xlsx")

#---- .sas7bdat
data.sas <- haven::read_sas("path_to_my_file/data.sas7bdat")

#---- Stata
data.stata <- haven::read_dta("path_to_my_file/data.dta")

5.4 Download data

You can download the datasets here.

5.5 Load basic_data and diag_data

Code
basic_data <- readRDS(here::here("data_raw/basic_data.rds"))

diag_data <- readRDS(here::here("data_raw/diag_data.rds"))

5.6 Explore data

5.6.1 Quick overview

5.6.1.1 dim()

The base function dim returns (1) the number of rows; (2) the number of columns.

Code
dim(basic_data)
[1] 1000000       4
Code
dim(diag_data)
[1] 2840870       6

5.6.1.2 colnames()

We can easily explore which columns each dataset contains using the base function colnames().

Code
colnames(basic_data)
[1] "id"         "birth_date" "death_date" "sex"       
Code
colnames(diag_data)
[1] "id"           "contact_type" "adm_date"     "dis_date"     "diag_a"      
[6] "diag_b"      

We prefer to work with variables in lowercase/snakecase. If that was not the case, it is fairly simple to change:

Code
colnames(data) <- tolower(colnames(data))

5.6.1.3 head()/tail()

You can also get a quick overview of the first/last n number of rows using head or tail.

Let’s look at the basic_data

Code
head(basic_data, 10)
   id birth_date death_date sex
1   1 1933-01-01       <NA>   1
2   2 1971-01-01       <NA>   1
3   3 1997-01-01       <NA>   1
4   4 1958-01-01       <NA>   1
5   5 1969-01-01       <NA>   2
6   6 1960-01-02       <NA>   1
7   7 1993-01-01       <NA>   2
8   8 1936-01-02 2023-09-09   2
9   9 2000-01-02       <NA>   2
10 10 1989-01-01       <NA>   1

… and the diag_data

Code
head(diag_data,10)
# A tibble: 10 × 6
   id    contact_type adm_date   dis_date   diag_a diag_b
   <fct> <fct>        <date>     <date>     <chr>  <chr> 
 1 1     1            2013-08-18 2013-08-22 DI12   DI10  
 2 1     2            2013-11-10 2013-11-10 DX06   DE13  
 3 2     2            2011-02-24 2011-02-24 DE13   DI14  
 4 2     2            2014-04-30 2014-04-30 DI15   DE13  
 5 3     1            2018-07-10 2018-07-12 DI50   DX08  
 6 3     1            2018-07-10 2018-07-12 DI50   DX01  
 7 3     2            2021-04-25 2021-04-25 DX03   DX04  
 8 3     1            2024-07-20 2024-07-25 DX08   DI50  
 9 4     2            2014-09-27 2014-09-27 DI12   DX06  
10 4     2            2018-12-08 2018-12-08 DX08   DI15  

5.6.1.4 glimpse()

glimpse is a function from the dplyr package. In contrast to head/tail, glimpse runs columns down the page and data runs across. This can be helpful if your dataset contains several variables.

Code
glimpse(basic_data)
Rows: 1,000,000
Columns: 4
$ id         <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ birth_date <date> 1933-01-01, 1971-01-01, 1997-01-01, 1958-01-01, 1969-01-01…
$ death_date <date> NA, NA, NA, NA, NA, NA, NA, 2023-09-09, NA, NA, NA, NA, NA…
$ sex        <fct> 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2,…
Code
glimpse(diag_data)
Rows: 2,840,870
Columns: 6
$ id           <fct> 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 7, …
$ contact_type <fct> 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, …
$ adm_date     <date> 2013-08-18, 2013-11-10, 2011-02-24, 2014-04-30, 2018-07-…
$ dis_date     <date> 2013-08-22, 2013-11-10, 2011-02-24, 2014-04-30, 2018-07-…
$ diag_a       <chr> "DI12", "DX06", "DE13", "DI15", "DI50", "DI50", "DX03", "…
$ diag_b       <chr> "DI10", "DE13", "DI14", "DE13", "DX08", "DX01", "DX04", "…

5.7 Choose your language

In R, there are different “languages” or styles of coding for data management. Here are three common ones:

  1. Base R This is the original way of working with data in R. It’s powerful but can be verbose and harder to read.

  2. data.table A high-performance package for large datasets. It’s fast and memory-efficient, but has a steeper learning curve and a unique syntax.

  3. dplyr (part of tidyverse) A modern and readable way to manage data.

    It uses intuitive verbs like select, filter, and mutate to describe what you want to do.

Why we use dplyr in this course:

  • Easy to read and write

  • Code is transparent and step-by-step structured

  • Widely used in health research and data science

  • Actively maintained and developed by the R Studio team

For beginners, dplyr is a great starting point — and it scales well as your skills grow.

5.8 dplyr

dplyr provides a set of intuitive functions for data manipulation, such as selecting columns, filtering rows, creating new variables, and summarising data.

A key feature of dplyr is its use of the pipe operator |>

What is the pipe?

The pipe operator allows you to write code that reads like a sequence of actions. Instead of nesting functions inside each other, you start with your data and then “pipe” it through a series of steps.

Think of it like saying:

  • Take this data → then do this → then do that → then do something else…

This makes your code easier to read and follow.

Example:

Code
data_names |> 
  select(first_name, last_name) |> 
  filter(first_name == "X", 
         last_name == "Y")

.. reads as:

  • take the data_names dataset

  • select the columns first_name and last_name

  • filter rows so first_name is equal to “X” and last_name is equal to “Y”

This structure is consistent across all dplyr functions and helps build a clear and logical workflow.

We will get back to each functions.

The pipe operator can look like |> and %>%. The shortcut is:

  • Windows: Shift + Ctrl + M

  • macOS: Shift + Cmd + M

If you see %>%, you can change it to |> by entering Shift + Crtl + P, write “pipe” and tick of “Use R’s native pipe operator, |>”

Now we will go through some key functions from mainly the dplyr package.

5.9 select() - Choosing columns

Selecting columns is one of the most common tasks in data wrangling. You use select() when you want to focus on specific variables.

When to use select?

  • To reduce clutter and focus on relevant variables
  • To prepare a subset of data for analysis or visualization
  • To reorder columns for readability

Example:

Take the diag_data and select the columns: id, adm_date, and dis_date.

Code
diag_data |> 
  select(id, adm_date, dis_date)
# A tibble: 2,840,870 × 3
   id    adm_date   dis_date  
   <fct> <date>     <date>    
 1 1     2013-08-18 2013-08-22
 2 1     2013-11-10 2013-11-10
 3 2     2011-02-24 2011-02-24
 4 2     2014-04-30 2014-04-30
 5 3     2018-07-10 2018-07-12
 6 3     2018-07-10 2018-07-12
 7 3     2021-04-25 2021-04-25
 8 3     2024-07-20 2024-07-25
 9 4     2014-09-27 2014-09-27
10 4     2018-12-08 2018-12-08
# ℹ 2,840,860 more rows

Helper functions

select() also supports helper functions to match column names. From the tidyr package (loaded with tidyverse), these functions are handy:

  • starts_with("prefix") - selects columns starting with “prefix”

  • ends_with("suffix") - selects columns ending with “suffix”

  • contains("content") - select columns containing “content”.

So, we could have used ends_with() as well, by typing:

Code
diag_data |> 
  select(id, ends_with("date"))
# A tibble: 2,840,870 × 3
   id    adm_date   dis_date  
   <fct> <date>     <date>    
 1 1     2013-08-18 2013-08-22
 2 1     2013-11-10 2013-11-10
 3 2     2011-02-24 2011-02-24
 4 2     2014-04-30 2014-04-30
 5 3     2018-07-10 2018-07-12
 6 3     2018-07-10 2018-07-12
 7 3     2021-04-25 2021-04-25
 8 3     2024-07-20 2024-07-25
 9 4     2014-09-27 2014-09-27
10 4     2018-12-08 2018-12-08
# ℹ 2,840,860 more rows

This can be particular helpful when working with wide datasets.

5.10 filter() - Choosing rows

While select() works on columns, filter() works on rows. You use it to keep only the observations that meet certain conditions. Note that we use == when we specify a condition.

Example:

Code
diag_data |> 
  filter(contact_type == 1)
# A tibble: 799,454 × 6
   id    contact_type adm_date   dis_date   diag_a diag_b
   <fct> <fct>        <date>     <date>     <chr>  <chr> 
 1 1     1            2013-08-18 2013-08-22 DI12   DI10  
 2 3     1            2018-07-10 2018-07-12 DI50   DX08  
 3 3     1            2018-07-10 2018-07-12 DI50   DX01  
 4 3     1            2024-07-20 2024-07-25 DX08   DI50  
 5 5     1            2012-02-05 2012-02-12 DX06   DE78  
 6 5     1            2012-12-23 2012-12-26 DX09   DX07  
 7 6     1            2012-06-10 2012-06-17 DX03   DX06  
 8 6     1            2018-01-14 2018-01-22 DI15   DI10  
 9 6     1            2018-01-14 2018-01-22 DI15   DX04  
10 9     1            2011-03-15 2011-03-28 DE14   DI15  
# ℹ 799,444 more rows

This keeps only rows where contact_type is equal to 1 (inpatient hospital contacts).

You can combine multiple conditions using

  • & for “and”

  • | for “or”

  • ! for “not”

5.11 mutate() - Creating new variables

To add new columns to your dataset, you can use mutate(). This is useful for calculating derived variables, recoding values, or creating labels.

Example:

Code
diag_data |> 
  mutate(
    contact_type_description = 
      case_when(
        contact_type == 1 ~ "Inpatient",
        contact_type == 2 ~ "Outpatient"
      )
  )
# A tibble: 2,840,870 × 7
   id    contact_type adm_date   dis_date   diag_a diag_b contact_type_descrip…¹
   <fct> <fct>        <date>     <date>     <chr>  <chr>  <chr>                 
 1 1     1            2013-08-18 2013-08-22 DI12   DI10   Inpatient             
 2 1     2            2013-11-10 2013-11-10 DX06   DE13   Outpatient            
 3 2     2            2011-02-24 2011-02-24 DE13   DI14   Outpatient            
 4 2     2            2014-04-30 2014-04-30 DI15   DE13   Outpatient            
 5 3     1            2018-07-10 2018-07-12 DI50   DX08   Inpatient             
 6 3     1            2018-07-10 2018-07-12 DI50   DX01   Inpatient             
 7 3     2            2021-04-25 2021-04-25 DX03   DX04   Outpatient            
 8 3     1            2024-07-20 2024-07-25 DX08   DI50   Inpatient             
 9 4     2            2014-09-27 2014-09-27 DI12   DX06   Outpatient            
10 4     2            2018-12-08 2018-12-08 DX08   DI15   Outpatient            
# ℹ 2,840,860 more rows
# ℹ abbreviated name: ¹​contact_type_description

This generates a column describing the contact_type variable. This doesn’t make much sense to do though.. But we will use mutate() sensible later on :)

Here, we used case_when() as well. That function is used to handle multiple conditions clearly.

In the code above, case_when() specifies

  • When contact_type is equal to 1 then assign contact_type_description the character value “Inpatient”

  • When contact_type is equal to 2, assign contact_type_description the character value “Outpatient”.

case_when() also has a “default” option. You can use it like: if this condition is true, do this, otherwise (default) do that. So, we could also have written:

Code
# 01

diag_data |> 
  mutate(
    contact_type_description = 
      case_when(
        contact_type == 1 ~ "Inpatient",
        .default = "Outpatient"
      )
  )

# 02

diag_data |> 
  mutate(
    contact_type_description = 
      case_when(
        contact_type == 1 ~ "Inpatient",
        TRUE ~ "Outpatient"
      )
  )

It is important to note, that case_when() handles the conditions in hierarchy. So, if condition 1 is true, that will overwrite any subsequent conditions. Likewise, if condition 1 is false, but condition 2 is true, that will overwrite a potential condition 3.

Moreover, you must provide the same class of output for each condition. E.g., above we use a character output in each condition. We can’t combine e.g., character output and numeric output.

Code
diag_data |> 
  mutate(
    contact_type_description = 
      case_when(
        contact_type == 1 ~ "Inpatient",
        TRUE ~ 0
      )
  )

5.12 group_by() - Grouping data

Grouping is essential when you want to perform calculations within subgroups (e.g., per patient, per hospital, per year, etc.).

Example:

Code
diag_data |> 
  group_by(id)

This tells R to treat each id as a separate group.

5.13 Excercise

Define inhospital admissions per patient.

  1. Which variables do we need?
  2. Look only at in inpatient contacts.
  3. Define the first recorded hospitalisation as number 1
  4. Also count the total number of hospitalisations per patient.

5.13.1 Choose the columns we need

Code
diag_data |> 
  select(id, contact_type, adm_date)
# A tibble: 2,840,870 × 3
   id    contact_type adm_date  
   <fct> <fct>        <date>    
 1 1     1            2013-08-18
 2 1     2            2013-11-10
 3 2     2            2011-02-24
 4 2     2            2014-04-30
 5 3     1            2018-07-10
 6 3     1            2018-07-10
 7 3     2            2021-04-25
 8 3     1            2024-07-20
 9 4     2            2014-09-27
10 4     2            2018-12-08
# ℹ 2,840,860 more rows

5.13.2 Look only at inpatient hospital contacts

Code
diag_data |> 
  select(id, contact_type, adm_date) |> 
  filter(contact_type == 1)
# A tibble: 799,454 × 3
   id    contact_type adm_date  
   <fct> <fct>        <date>    
 1 1     1            2013-08-18
 2 3     1            2018-07-10
 3 3     1            2018-07-10
 4 3     1            2024-07-20
 5 5     1            2012-02-05
 6 5     1            2012-12-23
 7 6     1            2012-06-10
 8 6     1            2018-01-14
 9 6     1            2018-01-14
10 9     1            2011-03-15
# ℹ 799,444 more rows

5.13.3 Order the first recorded hospitalisation as number 1

Here we make sure that the dataset is correctly ordered per patient.

To do this, we can use the group_by() and arrange() function.

Code
diag_data |> 
  select(id, contact_type, adm_date) |> 
  filter(contact_type == 1) |> 
  group_by(id) |> 
  arrange(id, adm_date)
# A tibble: 799,454 × 3
# Groups:   id [541,052]
   id    contact_type adm_date  
   <fct> <fct>        <date>    
 1 1     1            2013-08-18
 2 3     1            2018-07-10
 3 3     1            2018-07-10
 4 3     1            2024-07-20
 5 5     1            2012-02-05
 6 5     1            2012-12-23
 7 6     1            2012-06-10
 8 6     1            2018-01-14
 9 6     1            2018-01-14
10 9     1            2011-03-15
# ℹ 799,444 more rows

5.13.4 Keep only unique admissions

We can see above that some inpatient contacts occur twice.

To fix this, we can use the distinct() function which keeps only unique rows within the defined groups.

Code
diag_data |> 
  select(id, contact_type, adm_date) |> 
  filter(contact_type == 1) |> 
  group_by(id) |> 
  arrange(id, adm_date) |> 
  distinct()
# A tibble: 696,523 × 3
# Groups:   id [541,052]
   id    contact_type adm_date  
   <fct> <fct>        <date>    
 1 1     1            2013-08-18
 2 3     1            2018-07-10
 3 3     1            2024-07-20
 4 5     1            2012-02-05
 5 5     1            2012-12-23
 6 6     1            2012-06-10
 7 6     1            2018-01-14
 8 9     1            2011-03-15
 9 9     1            2021-06-22
10 11    1            2010-08-31
# ℹ 696,513 more rows

5.13.5 Create an indicator of admission number

Code
diag_data |> 
  select(id, contact_type, adm_date) |> 
  filter(contact_type == 1) |> 
  group_by(id) |> 
  arrange(id, adm_date) |> 
  distinct() |> 
  mutate(adm_nr = row_number())
# A tibble: 696,523 × 4
# Groups:   id [541,052]
   id    contact_type adm_date   adm_nr
   <fct> <fct>        <date>      <int>
 1 1     1            2013-08-18      1
 2 3     1            2018-07-10      1
 3 3     1            2024-07-20      2
 4 5     1            2012-02-05      1
 5 5     1            2012-12-23      2
 6 6     1            2012-06-10      1
 7 6     1            2018-01-14      2
 8 9     1            2011-03-15      1
 9 9     1            2021-06-22      2
10 11    1            2010-08-31      1
# ℹ 696,513 more rows

5.13.6 Create an indicator of total number of admissions

Code
diag_data |> 
  select(id, contact_type, adm_date) |> 
  filter(contact_type == 1) |> 
  group_by(id) |> 
  arrange(id, adm_date) |> 
  distinct() |> 
  mutate(adm_nr = row_number(),
         adm_total = max(adm_nr)
         )
# A tibble: 696,523 × 5
# Groups:   id [541,052]
   id    contact_type adm_date   adm_nr adm_total
   <fct> <fct>        <date>      <int>     <int>
 1 1     1            2013-08-18      1         1
 2 3     1            2018-07-10      1         2
 3 3     1            2024-07-20      2         2
 4 5     1            2012-02-05      1         2
 5 5     1            2012-12-23      2         2
 6 6     1            2012-06-10      1         2
 7 6     1            2018-01-14      2         2
 8 9     1            2011-03-15      1         2
 9 9     1            2021-06-22      2         2
10 11    1            2010-08-31      1         2
# ℹ 696,513 more rows

5.14 Joining dataset

To combine datasets, use join functions:

  • left_join() – keeps all rows from the left dataset

  • right_join() – keeps all rows from the right dataset

  • inner_join() – keeps only matching rows

  • full_join() – keeps all rows from both

When you join datasets, you need to specify

  1. The datasets to join
  2. The columns to join by

The by argument is rather essential. For instance, we can combine data from basic_data and diag_data using id. You can join by several columns if relevant.

Example: Join basic_data and diag_data

Let’s say you are interested in adding information birth date and death date to the diag_data

Code
# 01

left_join(diag_data, basic_data, by = "id")
# A tibble: 2,840,870 × 9
   id    contact_type adm_date   dis_date   diag_a diag_b birth_date death_date
   <fct> <fct>        <date>     <date>     <chr>  <chr>  <date>     <date>    
 1 1     1            2013-08-18 2013-08-22 DI12   DI10   1933-01-01 NA        
 2 1     2            2013-11-10 2013-11-10 DX06   DE13   1933-01-01 NA        
 3 2     2            2011-02-24 2011-02-24 DE13   DI14   1971-01-01 NA        
 4 2     2            2014-04-30 2014-04-30 DI15   DE13   1971-01-01 NA        
 5 3     1            2018-07-10 2018-07-12 DI50   DX08   1997-01-01 NA        
 6 3     1            2018-07-10 2018-07-12 DI50   DX01   1997-01-01 NA        
 7 3     2            2021-04-25 2021-04-25 DX03   DX04   1997-01-01 NA        
 8 3     1            2024-07-20 2024-07-25 DX08   DI50   1997-01-01 NA        
 9 4     2            2014-09-27 2014-09-27 DI12   DX06   1958-01-01 NA        
10 4     2            2018-12-08 2018-12-08 DX08   DI15   1958-01-01 NA        
# ℹ 2,840,860 more rows
# ℹ 1 more variable: sex <fct>
Code
# 02

diag_data |> left_join(basic_data, by = "id")
# A tibble: 2,840,870 × 9
   id    contact_type adm_date   dis_date   diag_a diag_b birth_date death_date
   <fct> <fct>        <date>     <date>     <chr>  <chr>  <date>     <date>    
 1 1     1            2013-08-18 2013-08-22 DI12   DI10   1933-01-01 NA        
 2 1     2            2013-11-10 2013-11-10 DX06   DE13   1933-01-01 NA        
 3 2     2            2011-02-24 2011-02-24 DE13   DI14   1971-01-01 NA        
 4 2     2            2014-04-30 2014-04-30 DI15   DE13   1971-01-01 NA        
 5 3     1            2018-07-10 2018-07-12 DI50   DX08   1997-01-01 NA        
 6 3     1            2018-07-10 2018-07-12 DI50   DX01   1997-01-01 NA        
 7 3     2            2021-04-25 2021-04-25 DX03   DX04   1997-01-01 NA        
 8 3     1            2024-07-20 2024-07-25 DX08   DI50   1997-01-01 NA        
 9 4     2            2014-09-27 2014-09-27 DI12   DX06   1958-01-01 NA        
10 4     2            2018-12-08 2018-12-08 DX08   DI15   1958-01-01 NA        
# ℹ 2,840,860 more rows
# ℹ 1 more variable: sex <fct>

You can also specify which columns you wish to add directly in the join function.

Let’s say we only wanted to add birth_date

Code
diag_data |> 
  left_join(
    basic_data |> select(id, birth_date), 
    by = "id")
# A tibble: 2,840,870 × 7
   id    contact_type adm_date   dis_date   diag_a diag_b birth_date
   <fct> <fct>        <date>     <date>     <chr>  <chr>  <date>    
 1 1     1            2013-08-18 2013-08-22 DI12   DI10   1933-01-01
 2 1     2            2013-11-10 2013-11-10 DX06   DE13   1933-01-01
 3 2     2            2011-02-24 2011-02-24 DE13   DI14   1971-01-01
 4 2     2            2014-04-30 2014-04-30 DI15   DE13   1971-01-01
 5 3     1            2018-07-10 2018-07-12 DI50   DX08   1997-01-01
 6 3     1            2018-07-10 2018-07-12 DI50   DX01   1997-01-01
 7 3     2            2021-04-25 2021-04-25 DX03   DX04   1997-01-01
 8 3     1            2024-07-20 2024-07-25 DX08   DI50   1997-01-01
 9 4     2            2014-09-27 2014-09-27 DI12   DX06   1958-01-01
10 4     2            2018-12-08 2018-12-08 DX08   DI15   1958-01-01
# ℹ 2,840,860 more rows